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Abstract 

Current processor allocation techniques for highly 
parallel systems have thus far been restricted to con- 
tiguous allocation strategies for which performance 
suffers significantly due to the inherent problem of 
fragmentation. We are investigating processor allo- 
cation algorithms which lift the restriction on conti- 
guity of processors in order to address the problem 
of fragmentation. Three non- contiguous processor al- 
location strategies: Naive, Random and the Multi- 
ple Buddy Strategy (MBS) are proposed and studied 
in this paper. Simulations compare the performance 
of the non- contiguous strategies with that of several 
well-known contiguous algorithms. We show thai non- 
contiguous allocation algorithms perform better overall 
than the contiguous ones, even when message- passing 
contention is considered. We. also present the results 
of experiments on an Intel Paragon XP/S-15 with 208 
nodes that show non- contiguous allocation is feasible 
with current technologies. 



1 Introduction 

Highly parallel systems have the promise of outper- 
forming traditional vector supercomputers in terms of 
price/performance on a wide range of individual ap- 
plications. However, mainstream computing does not 
run "individual applications," but instead, supports a 
workload that is a diverse mix of large and small jobs. 
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With respect to overall system utilization, traditional 
supercomputers are far ahead of parallel systems, the 
former usually achieving 98-99% utilization. A major 
price/performance obstacle to highly parallel systems 
is efficiently supporting workloads. 

The processor allocation problem involves the de- 
sign of algorithms for allocating a set of processors 
to a given parallel job with the goal of maximizing 
throughput over a stream of many jobs. Allocation 
techniques used in current commercial parallel ma- 
chines, as well as in the research community, have 
thus far been restricted to contiguous allocation in 
which the processors are constrained to be physically 
adjacent. In addition, many systems also require that 
the allocated processors form a subgraph of the origi- 
nal architecture, specifically, subcube allocation in hy- 
percubes and submesh allocation in meshes. The re- 
curring theme found in all these studies is that perfor- 
mance suffers significantly due to internal and external 
fragmentation, or due to high overheads for allocation 
and deallocation. Internal fragmentation occurs 
when more processors are allocated to a job than it 
requests. External fragmentation exists when a 
sufficient number of processors are available to satisfy 
a request, but they cannot be allocated contiguously. 
Experimental evidence has shown that little improve- 
ment in performance can be realized by refinements 
of contiguous allocation algorithms [5). As a result, 
recent research efforts have focused on the choice of 
scheduling policies and their impact on contiguous al- 
location schemes. 

Our research takes a different approach to overcom- 
ing the limitations of contiguous allocation. We are in- 
vestigating processor allocation algorithms which lift 
the restriction on contiguity of processors in order to 
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address ihe problem of fragmentation. As we shall 
show. 11 on -contiguous allocation offers several sig- 
nificant advantages over contiguous schemes: elimina- 
tion of internal and external fragmentation; low allo- 
cation and deallocation overheads; compatibility with 
adaptive processor allocation schemes [10] in which a 
job may increase or decrease its allocation at runtime; 
and straightforward extensions for fault tolerance. 

Current communication technologies like worm hole 
routing enable us to consider non-contiguous alloca- 
tion, since the delay due to the number of hops be- 
tween processors is known to be negligible. However, 
we also note that non-contiguous allocation introduces 
potential problems due to message contention because 
the messages occupy more links, yielding potential 
communication interference with other jobs. There- 
fore, the most successful allocation scheme may be 
a hybrid between contiguous and non-contiguous ap- 
proaches. 

We compare the performance of three non- 
contiguous processor allocation strategies: Naive, 
Multiple Buddy Strategy (MBS) ; and Random allo- 
cation, with three well-known contiguous allocation 
schemes: Frame Sliding, First. Fit, and Best Fit. The 
strategies we present, represent a continuum with re- 
spect to degree of contiguity. These strategies are also 
directly applicable to processor allocation in i-ary n- 
cubes which include the hyperc.ube and torus. 

Section 2 gives a brief summary of previous work in 
the area of processor allocation for mesh topologies. 
Section 3 presents the results of preliminary experi- 
ments on an Intel Paragon XP/S-15 with 208 compute 
nodes. Section 4 describes the three non-contiguous 
schemes and discusses the Multiple Buddy Strategy, 
an algorithm we have developed that, exploits the ad- 
vantages of non-contiguity with respect to fragmen- 
tation while addressing the potential contention that 
may be introduced. Section 5 analyzes the perfor- 
mance of these strategies through simulation results. 
Section 6 summarizes our results and discusses future 
work. 



2 Previous Research Work 

The Multiple Ruddy Strategy proposed in this pa- 
per is an extension of the 2-D Buddy Strategy. Our 
simulations compare the performance of MBS with 
Frame Sliding, First Fit and Best Fit. The Kreuger 
paper [5] describes the performance limitations of all 
contiguous allocation schemes and thus motivates our 
investigation of non-contiguous approaches. 



The two-dimensional buddy strategy, a generaliza- 
tion of the one-dimensional binary buddy system for 
memory management, is proposed by Li and Cheng 
[6] for a mesh connected system. Under this strategy, 
all incoming jobs are given square submeshes of size 
n' x n' and the system itself is a square mesh of size 
n x 7i , where both n' and n are exact powers of 2. The 
allocation and deallocation overheads of this strategy 
are all O(logn), which is relatively low compared to 
other strategies. However, it . can only be applied to 
square meshes, it suffers from severe internal fragmen- 
tation, because a square submesh with side length of 
2* is always required, and it has significant external 
fragmentation. The Intel Paragon uses an extension 
to the 2-D buddy strategy which is applicable to non- 
square meshes and allows allocation across more than 
one size buddy. [9] 

Chuang and Tzeng proposed an improved strategy 
called the frame sliding strategy [3]. It is applicable to 
any mesh system and any shape of submesh request, 
thus it has no internal fragmentation. The frame slid- 
ing strategy examines the first candidate "frame" from 
the lowest leftmost available processor and slides the 
candidate frame horizontally or vertically by the stride 
of width or height of the requested submesh, respec- 
tively, until an available frame is found, or all can- 
didate frames are checked. This strategy has better 
performance than the 2-D buddy strategy. However, 
it has higher allocation overhead O(n), it suffers from 
large external fragmentation, and it cannot recognize 
all possible free submeshes. 

In [13], Zhu proposed the first fit and best fit strate- 
gies, which can be applied for contiguous submesh re- 
quests of arbitrary sizes and have the ability to recog- 
nize all free submeshes in a system. These algorithms 
locate submeshes by constructing bit arrays indicat- 
ing which processors have enough free neighbors to 
host the base node, the lower-left processor. These 
bit arrays can then be searched for the first available 
submesh (first fit) or submesh that best fits the re- 
quest (best fit). Both strategies suffer from significant 
external fragmentation. These algorithms both have 
allocation and deallocation overhead of O(n). 

Krueger et. al have shown in [5] that increasingly 
sophisticated processor allocation algorithms do not 
significantly influence the performance of hypercube 
systems. Their simulations of four well-known hy- 
percube allocation strategies realized limited improve- 
ments despite the differing abilities of these algorithms 
to reduce fragmentation and recognize available sub- 
cubes. The barriers observed by Krueger et. al. are 
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primarily a direct result of externa! fragmentation, 
which arises from the contiguity constraint. Although 
no statistics have been compiled for mesh systems, we 
believe the same trend will be exhibited under the as- 
sumption of contiguity. Thus, improved performance 
requires exploration of other alternatives, including 
scheduling policies [2] [8J [11] and the approach we 
propose: non- contiguous allocation. 

3 Contention on Real Systems 

The potential increased message contention caused 
by non- contiguous allocation could greatly reduce per- 
formance. As a first step in evaluating the feasibility 
of non-contiguous allocation, we measured worst-case 
contention and ran benchmarks on the Intel Paragon 
XP/S-15 at the Numerical Aerodynamic Simulation 
(NAS) facility at NASA Ames Research Center. The 
NAS Paragon is a distributed memory multicomputer 
with 208 compute nodes connected by a 175 megabyte 
per second bi-directional mesh, with worm hole, XY 
routing. In addition to using the operating system 
supplied by Intel, Paragon OS release 1.1, we ran 
worst-case contention tests under SUNMOS, a min- 
imal operating system developed by Sandia National 
Labs and the University of New Mexico. 

One would expect message contention to have a no- 
ticeable impact on performance; however, we were un- 
able to measure any performance degradation on real 
applications due to contention. We ran two, four, and 
six copies of selected NAS Parallel Benchmarks [1] si- 
multaneously and found no measurable performance 
difference compared with running the benchmarks one 
at a time on a dedicated system. For example, we par- 
titioned the compute nodes based on diagonals, split- 
ting the machine into four partitions (64, 64, 32, and 
32 nodes) in a checkerboard pattern — this pattern was 
designed to encourage message contention between ap- 
plications. We ran the best Intel-supplied implemen- 
tations of selected NAS Parallel Benchmarks (FFT, 
MG, and CG) in a loop on each partition. The largest 
performance variation was for ~FFT, which varied by 
less than 2% over 13 runs. We saw similar results for 
other partitioning schemes. 

In order to better quantify the effects of message 
contention on the Paragon, we developed a simple 
worst- case contention generating program, contend. 
To force contention on the XY routed mesh of the 
Paragon, we allocated the nodes on the north and east 
edges of the mesh. Nodes were paired from the mid- 
dle outward, and each pair exchanged messages. With 
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Figure 1: Worst Case Contention on the Intel Paragon 
(Paragon OS Rl.l) 



this configuration, all messages must traverse one com- 
mon network link. We ran contend on up to nine pairs 
of simultaneously communicating nodes, with message 
sizes ranging from 0 to 64 kilobytes. 

Under the native Paragon operating system (Figure 
1), virtually no contention is noticeable (RPC times 
are flat) through six pairs of communicating nodes. 
Starting with seven pairs, contention begins to slow 
message- passing performance, but only for messages 
larger than 16 kilobytes. This surprising result is an 
artifact of the current release of the operating sys- 
tem, and explains our inability to measure contention 
while running multiple simultaneous copies of the NAS 
benchmarks. Although the Paragon hardware sup- 
ports 175 megabytes per second bandwidth, the cur- 
rent release of the operating system (Rl.l) delivers 
only about 30 megabytes per second. The hardware 
has more than enough excess bandwidth to support 
about six pairs of communicating nodes without any 
noticeable contention (6 x 30 = 180). 

We then ran contend on the Paragon under the 
SUNMOS operating system (Figure 2), which deliv- 
ers 170 megabytes per second bandwidth, nearly peak 
speed. With the anomalous operating system behav- 
ior eliminated, the effects of contention are significant 
with only two pairs of communicating nodes, and in- 
crease linearly with the number of pairs. However, 
small messages (less than one kilobyte) appear to be 
little effected by contention, even with nine pairs of 
communicating nodes. 

Although preliminary, and by no means exhaus- 
tive, our experiments show two interesting facts about 
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Figure 2: Worst Case Contention on the Intel Paragon 
(SUNMOS SI. 0.94) 



contention on the Paragon: current operating system 
overhead subsumes contention effects, and small mes- 
sages do not cause contention. The poor operating 
system performance will likely be corrected in future 
releases of the Paragon OS, but the contention effects 
of small messages will likely remain. Van Voorst, et. 
al. [12], measured the workload of the Intel iPSC/860 
system at NAS for ten days, and found that 87% of all 
messages are, in fact, one kilobyte or less. So, at least 
for a class of scientific applications, large messages 
may not be a significant issue. This empirical data 
is encouraging, and supports the notion that a purely 
non-contiguous allocation strategy may run into con- 
tention effects with large messages, but a purely con- 
tiguous strategy is also unnecessary. 



4 Non-contiguous Allocation 

4.1 Random and Naive Strategies 

One of the most straightforward non-contiguous al- 
location strategies is Random allocation strategy, 
under which a request for k processors is satisfied with 
k randomly selected processors. Both internal frag- 
mentation and external fragmentation are eliminated, 
since all jobs are assigned exactly the requested num- 
ber of processors if available. No contiguity is enforced 
under this strategy. Another simple non-contiguous 
allocation strategy is Naive allocation strategy, 
under which a request for k. processors is satisfied 
by the first k free processors in a row major scan of 
the mesh. Some degree of contiguity is maintained 
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Figure 3: Eliminating system fragmentation using 
MBS 

through the nature of the row major scan. Similar to 
the Random allocation strategy, there will be neither 
internal nor external fragmentation. The complexity 
of the allocation and deallocation algorithms for both 
of these strategies is O(k). 

4.2 Multiple Buddy Strategy 

In the following, we introduce another non- 
contiguous allocation strategy, which we call the Mul- 
tiple Buddy Strategy (MBS). It is an extension of the 
2-D buddy strategy (6), which has both internal and ex- 
ternal fragmentation problems. MBS eliminates frag- 
mentation by applying the non-contiguous model to 
the mesh system, while still maintaining contiguity 
within individual blocks. A more detailed and formal 
discussion of MBS may be found in [7]. 

The following scenarios show the problems exhib- 
ited by the 2-D buddy strategy and how they are re- 
solved by the MBS strategy. Square submeshes are 
represented by < x,y, s >, in which < x,y > is the 
location of the lower leftmost processor, and s is the 
side length of the submesh of size s x s. 

In Fig 3(a), a mesh of size 2 3 x 2 s has three allo- 
cated submeshes (represented by the black squares): 
< 0,0,2 >, < 4,0,1 >, and < 4,4,1 >. Assume that 
a job which needs 5 processors is submitted to the sys- 
tem. Since all submeshes have to be 2* x 2 1 under the 
2-D buddy strategy, a 4 x 4 submesh (< 0.4,4 >) will 
be allocated. In this case 11 processors in the submesh 
will be wasted during the lifetime of the job. Under 
t he MBS strategy, the exact number of processors will 
be assigned to the job. In the above case, two blocks 
will be assigned to the job: < 2,0,2 > and < 5,0, 1 >. 
Therefore, the internal fragmentation is eliminated. 
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Assume that the mesh shown in Fig 3(b) receives a 
request for 16 processors. Since a 4 x 4 block cannot 
be found in the mesh using the 2-D buddy strategy, 
the request will be put into a waiting queue, resulting 
in external fragmentation. The MBS strategy resolves 
this problem by breaking a large request into smaller 
blocks. In the above case, 4 blocks of size 2x2 will be 
assigned to the job. Since larger requests can always 
be broken down to 1 x 1 blocks, there will not be any 
external fragmentation. 

Under MBS a request for k processors is represented 
as a base 4 number of the form k = d m x 2 m x 2 m + ... -j- 
d 0 x 2° x 2°. MBS attempts to satisfy this request us- 
ing blocks of sizes 2 m x 2 m , ■ . 2° x 2°. If a block of a 
desired size is unavailable, MBS searches for a bigger 
block which it repeatedly breaks down into buddies 
until it produces blocks of the desired size. If that 
fails, MBS breaks a request for a block of size 2 1 x 2* 
into four requests for blocks of sizes 2*~ l x 2*" 1 . This 
process continues until the original request for k blocks 
is satisfied. The proposed MBS strategy can eliminate 
the fragmentation problems and be implemented effi- 
ciently. It is composed of the following 5 parts: system 
initialization, request factoring algorithm, buddy gen- 
erating algorithm, allocation algorithm, deallocation 
algorithm. 

4.2.1 System Initialization 

System initialization is done only once at system 
startup time. At this time, the whole mesh system is 
divided into initial blocks, which are non-overlapped 
square submeshes with side lengths that are exactly 
powers of 2. The initialization process allows the strat- 
egy to be applicable to any size mesh system. 

A block and its buddies are defined recursively as 
follows. Any initial block is a block. Each block is 
a square mesh and represented by < x,y, p >, where 

< x, y > is the location of the lower leftmost processor, 
and p is the side length of the submesh. If < x, y, p > 
is a block and p > 1, then < x, y, § >, < x+ § , y, § >, 

< x,y+ §,§ >, and < x -f § ,y+ §,§ > are blocks, 
and they are buddies of each other. 

The concept of free block records (FBR) extends the 
notion of the free block lists in the 2-D buddy strat- 
egy. FBR[i] records the number (F BR[i]. block jnum) 
of available blocks of size 2* x 2 1 and an ordered list 
(FBR\i]. block Jist) of the locations of such blocks. At 
startup time, information about the initial blocks will 
be kept in FBR& % where FBR[i}. block jwim keeps the 
number of 2* x 2'" initial blocks and FBR[i]. block Jist 
keeps a location list of such initial blocks. Another 



global variable AVAIL, the current number of avail- 
able processors in the system, is initialized to the 
number of processors in the mesh system during the 
startup: 

4.2.2 Request Factoring Algorithm 

Any integer has a base 4 representation, expressible 
as a sum £iL°o 4 nJ <*, x (2* x 2») where 0 < d, < 3. 
Thus any legal job request can be accommodated by 
di blocks of size 2 1 x 2' . At most [log4 ri] distinct 
blocks are needed with a maximum of 3 blocks of a 
given size. 

We define the maximum distinct blocks (MaxDB) of 
a given mesh system as flog 4 n] , where n is the number 
of processors in the system. The factoring algorithm 
needs to take as input the job size and produces as 
output a request array (Request ~Array\0..M ax DB]) y 
where Request -Array[i] stands for the number of size 
2* x 2 1 blocks that the job needs. The algorithm 
essentially is an integer conversion algorithm, where 
Request -Array[i] is the ith digit in the base 4 integer 
representation of the job size. 

4.2.3 Buddy Generating Algorithm 

The buddy generation algorithm breaks a large block 
into several smaller blocks to satisfy the 2* x T re- 
quests. It contains two phases. In the first phase, 
an available block is sought by examining the FBRs 
in increasing order of block size from 2 i+1 x 2 ,+l to 
2 mM x 2 m <"\ During the second phase, the block is 
repetitively broken down into smaller buddies until 
the desired size blocks are found. If no block is found 
in the search phase, as we shall see, the allocation 
algorithm will break the request down into smaller re- 
quests. 

4.2.4 Allocation and Deallocation Algorithm 

The allocation algorithm includes two main parts. 
First, the request is factored and stored in Re- 
quest .Array [ij. If possible, each request for a block 
of size i is allocated immediatedly from FBR[i], Oth- 
erwise, an attempt is made to satisfy this request from 
a larger block by breaking it into smaller buddies. 
If that fails, the request will be broken down into 4 
smaller requests, which are stored in Re quest _ Array [i- 
1). By the above algorithm, job requests are satis- 
fied with the exact number of processors, and large 
requests can be accommodated by available smaller 
blocks; hence we can conclude that the Multiple 
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Buddy Strategy suffers from neither interna) nor ex- 
ternal fragmentation. 

The deallocation procedure is essentially the same 
as that of the 2-D buddy strategy. Instead of returning 
just one block to the system, the MBS strategy needs 
to return all blocks owned by the job to the mesh 
system, and merge the buddies up to restore the larger 
blocks. 

Complexity for the allocation algorithm and the 
deallocation algorithm in the worst case is 0{n). 
For allocation, the accumulated overhead on gener- 
ate_buddy is O(logn) and at most O(n) block entries 
will be allocated, which lias 0(n) time overhead. For 
deallocation, since the maximum number of buddy 
merges is 5 + £ + . . + j = l n = Q(n) , he over 

head for deallocation in the worst case will not. exceed 
0(n). 



5 Performance Analysis 

We conducted two distinct sets of simulation exper- 
iments to analyze the performance of non-contiguous 
allocation strategies compared to contiguous ones. (1) 
fragmentation experiments and (2) message- passing 
experiments. Our discrete event simulator was im- 
plemented in C using the Rice Parallel Processing 
Testbed Tools YACSIM, a general simulation library, 
and NETSTM, a library of network simulation exten- 
sions [4]. 

The fragmentation experiments model the arrival, 
service, and departure of a stream of jobs in a mesh- 
connected system using first-come, first-serve schedul- 
ing (FCFS). These high-level experiments focus on the 
effects of system fragmentation (both internal and ex- 
ternal). Thus, the overhead of allocation and deallo- 
cation is ignored in the simulation, and the message- 
passing behavior of the algorithms is not modeled. 

The message-passing experiments model the same 
stream of jobs, but at a much finer- grained level. 
The detailed message- passing behavior in a mesh with 
wormhole routing is simulated down to the level of in- 
dividual flits and message- passing buffers. The pur- 
pose of these experiments is to carefully examine the 
message contention introduced by non-contiguity. 

5.1 Fragmentation Experiments 

The first set of experiments, studying the effects of 
fragmentation on system utilization and job response 
time, is modeled after the simulation experiments con- 
ducted in previous allocation strategy research [13] [3] 



[5}. In these experiments, jobs arrive, delay for an 
amount of time taken from an exponential distribu- 
tion, and then depart. Message- passing is not mod- 
eled. 

The contiguous allocation strategies simulated in 
these experiments are First Fit, Best Fil[13], and 
Frame Sliding [3j. From the non-contiguous strate- 
gies, we only present the results for Multiple Buddy 
Strategy, which performs identically to Random and 
Naive with respect to system fragmentation. The job 
request streams were modeled taking the submesh re- 
quest sizes from the uniform, exponential, increasing, 
and decreasing distributions. The independent vari- 
able in these experiments was the system load, de- 
fined as the ratio of the mean service time to mean 
interarrival time of jobs. Higher system loads reflect 
the greater demands when jobs arrive faster than they 
can be processed. For example, under a system load 
of 1.0, jobs arrive as fast as the are serviced, on the 
average, and under a system load of 2.0, jobs arrive 
twice as fast as they can be serviced. See [7] for more 
simulation details. 

For each job size distribution in these experiments, 
we measure: 

o Finish Time- the time required for completion of 
all the jobs. 

o System Utilization - the percentage of processors 
that are utilized over time. 

o Job Response Time - the time from when a job 
arrives in the waiting queue until the time it com- 
pletes. 

All simulations model a 32 x 32 mesh and run un- 
til 1000 jobs have been completed. Results reported 
for the fragmentation experiments represent the sta- 
tistical mean after 24 simulation runs with identical 
parameters, and given 95% confidence level, mean re- 
sults have less than 5% error. 

Table 1 shows how well the four algorithms handle 
a system saturated by job requests with job sizes taken 
from each distribution. Simulation results for a heavy 
system load of 10.0 are presented since, at this load, 
the system waiting queue is filled very early in the 
simulation, allowing each allocation strategy to reach 
its upper limits of performance. 

In all cases, the non-contiguous Multiple Buddy 
Strategy performs much better than First Fit, Best 
Fit, and Frame Sliding. With uniform, exponential, 
and decreasing distributions, simulations using MBS 
finish at least 57% faster than any of the other algo- 
rithms and very dramatic improvements are also made 
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Algorithm 


Job Size Distribution 
Uniform Expon. Incr. c Deer.* 


MBS 
FF 
BF 
FS 


Finish Time (simulation time uniis) 
365.32 258.68 753.66 119.89 

582.01 429.57 882.94 237.90 
573.79 428.72 883.08 231.92 

608.02 457.88 885.56 267.40 


MBS 

FF 

BF 

FS 


System Uiilizalion (percent) 
72.39 69.36 70.18 77.32 
45.96 41.68 60.15 39.15 
45.70 41.64 60.30 39.28 
43.39 38.47 59.84 34.30 



b f\ ip4] = 0.4, f\ b 8 , = = 0.2 t /^ 16 32J = 0.2 



Table 1: Fragmentation experiment results: Finish 
Time and System Utilization of each algorithm un- 
der different job size distributions for a heavy system 
load (10.0). 



in system utilization. Improvement is less dramatic, 
though still significant, under the increasing distribu- 
tion because the large job sizes tend to degrade the 
system towards the point where it can only service 
one job at a time. 

Figure 4 graphs the system utilization for these 
same algorithms and the uniform job size distribution 
at varying system loads. It shows that MBS can ac- 
commodate a much higher system load before becom- 
ing overloaded, and that the system utilization at this 
point is much higher. 

The results for contiguous allocation measured in 
these experiments are all consistent with those re- 
ported by Zhu in [13]. 

These fragmentation experiments indicate that 
non-contiguous allocation is far superior to contigu- 
ous in terms of its ability to utilize the processors. 
Because non- contiguous allocation can always allocate 
a job if there are enough processors available, elimi- 
nating external fragmentation, it is shown to achieve 
higher system utilization. Thus, non-contiguous al- 
location allows for greater job throughput. However, 
these results ignore the increased communication con- 
tention that may be introduced as a result of non- 
contiguous allocation. Therefore, in order to validate 
non-contiguous allocation as a viable strategy, experi- 
ments'must be performed to evaluate message- passing 
performance. 
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Figure 4: Fragmentation experiment results: System 
Utilization vs. System Load for the uniform distribu- 
tion of job size. 

5.2 Message- passing Experiments 

The second set of experiments measure message- 
passing contention and its effects on overall perfor- 
mance. The same simulator used in the fragmenta- 
tion experiments was extended to model the sending 
and receiving of messages between the processors al- 
located to a job. Thus, rather than simply delaying 
for a given service time, processors allocated to the 
job communicate with each other according to a given 
communication pattern. The communication pattern 
iterates until the number of messages sent within the 
job has reached its message quota, a value taken from 
an exponential distribution. This quota ensures that 
the job service times are independent of the job 6izes. 
Once communication ceases, the job departs from the 
system and is deallocated. 

The interconnection network is modeled by XY 
routing switches. These routing switches are con- 
nected by two uni-directional channels to neighboring 
switches in the mesh and to the corresponding proces- 
sor elements. The flow control mechanism governing 
flit movement (flits are the smallest unit of data trans- 
mission in the network) is wormhole routing. Messages 
originate from a processor element and their flits tra- 
verse the network in pipeline fashion to their destina- 
tion processor. If the header flit of a packet is routed 
to a busy channel, that header flit and its trailing flits 
stop moving and block whichever channels they oc- 
cupy in the network. This results in packet blocking 
time, due to contention, which can be measured in the 
simulation. 
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The message- passing experiments implement five 
communication patterns: all-to-al] broadcast, onc-to- 
all broadcast, the n-body computation, fast fourier 
transform (FFT), and multigrid (MG) from the NAS 
parallel benchmarks. These cover many communica- 
tions patterns used very frequently by highly parallel 
applications and provide a spectrum of message pass- 
ing complexity ranging from O(n) to 0{n 7 ). 

For simplicity and consistency, the internal map- 
ping of the processes within each job is a row-major 
ordering of processors in each contiguously allocated 
block. This makes the latter three patterns very 
interesting cases, since the row- major mapping of 
these patterns is well-suited to contiguous allocations. 
These cases will be examined in more detail below. 

Using this simulation model, experiments were con- 
ducted for each communication pattern using job 
streams generated from the uniform distribution. The 
network communication delay parameters were cho- 
sen such that the average job service times were great 
enough to result in high system loads, and thus, 
minimal system fragmentation. Experimental results 
are presented for Multiple Buddy Strategy, Random, 
Naive, and First Fit allocation. First Fit was chosen 
as a representative of contiguous allocation strategies 
since it has been shown to perform as well as the oth- 
ers [13]. See [7] for more simulation details. 

From each simulation run, we measure: 

♦ Finish Time. - the time required for completion 
of all the jobs. Finish time is a good measure of 
overall performance. 

♦ Service Time - the time from when a job begins 
execution until the time it completes. Service 
time includes the total communication latency for 
the communication pattern executed. 

• Packet Blocking Time - the time that a packet is 
blocked in the network waiting for a channel to 
become free. Packet blocking time is a measure 
of the contention. 

• Weighted Dispersal - the degree of non-contiguity 
for an allocation, approximating the percentage of 
links that are potential sources of contention. Dis- 
persal is defined as t he number of unallocated pro- 
cessors divided by the total number of processors 
in the smallest rectangle circumscribing all pro- 
cessors allocated to a specific job. The weighted 
dispersal, then, is the job s dispersal multiplied 
by the number of processors allocated to the job. 



All message- passing simulations model a 16 x 16 
mesh and run until 1000 jobs have been completed. 
Results reported represent the statistical mean af- 
ter 10 simulation runs with identical parameters and, 
given 95% confidence level, the mean results have less 
than 5% error, with the exception of service times, 
which have less than 10% error. 

Table 2(a) shows the results of simulations for jobs 
executing the heavy All- to- All communication pat- 
tern. As expected, contiguous allocation shows the 
least amount of contention (as seen in the packet 
blocking times). However, based on the overall finish 
times, MBS and Naive significantly outperform the 
other strategies. The reason is that, although they 
suffer slightly more contention than contiguous allo- 
cation, the improvements in system utilization still 
outweigh the increased communication overhead. Jt 
is also interesting to note that MBS and Naive alloca- 
tion result in only moderate dispersal when compared 
to Random allocation, which performs poorly. 

Table 2(b) shows the results of simulations for jobs 
executing the One-to- A 11 communication pattern. Un- 
der the lighter traffic load induced by this pattern, the 
contention effects seen ir, the experiments with Ail- 
to- A 11, are reduced. Agtvi MBS and Naive perform 
best overall, showing only moderate dispersal and con- 
tention. Contiguous allocation finishes last, taking 
42% more time than MBS. 

Table 2(c) shows the results of simulations for jobs 
executing the n-body communication pattern. The 
packet blocking times show that contiguous strate- 
gies have very little contention for this pattern, in 
which almost all communication occurs between ad- 
jacent, neighbors when mapped by a row-major order- 
ing. For MBS and Naive, contention increases some- 
what, but still remains relatively low due to the fact 
that some degree of contiguity is maintained. This 
allows the ring communication to still be executed ef- 
ficiently. The increased contention for MBS and Naive 
allocation is not significant enough to outweigh the im- 
provements in system utilization. Random performs 
much worse than any of the others since it cannot take 
advantage of the regular ring communication in the n- 
body. Overall, MBS and Naive still finish faster than 
either Random or contiguous allocation. 

Tables 2(d) and 2(e) show the results of simulations 
for jobs executing two communication patterns that 
are well matched to the mesh topology of the target 
machine. Due to restrictions imposed by the commu- 
nication pattern, all job request sizes were rounded to 
the nearest power of two in these experiments. Be- 



NSTOCID:<XP 1CM00481A I > 



(a) AII-To-AH Broadcast 


Algorithm 


Finish 


Average Packet 


Weighted 




Time 


Blocking Time 


Dispersal 


Random 


326620 


33.968 


42.037 


MBS 


273987 


29.216 


26.717 


Naive 


232157 


21.990 


14.832 


First Fit 


323343 


21.154 


0 


(b) One-To-All Broadcast 


Algorithm 


Finish 


Average Packet 


Weighted 




Time 


Blocking Time 


Dispersal 


Random 


54 S4 


0.40980 


42.298 


MBS 


5045 


0.36506 


27.002 


Naive 


5105 


036700 


14.911 


First Fit 


7166 


035001 


0 


(c) n-Body 


Algorithm 


Finish 


Average Packet 


Weighted 




Time 


Blocking Time 


Dispersal 


Random 


26219 


0.228657 


41.916 


MBS 


9044 


0.013394 


29.956 


Naive 


8990 


0.014407 


18.400 


First Fit 


11903 


0.004326 


0 


(d) 2D FFT 


Algorithm 


Finish 


Average Packet 


Weighted 




Time 


Blocking Time 


Dispersal 


Random 


2431 


0.21896 


32.302 


MBS 


968 


0.15387 


12.161 


Naive 


1352 


0.19339 


14.470 


First Fit 


774 


0.07494 


0 


(e) NAS M 


ultigrid Benchmark 




Algorithm 


Finish 


Average Packet 


Weighted 




Time 


Blocking Time 


Dispersal 


Random 


3132 


0.21734 


31.826 


MBS 


1083 


0.08051 


12.0389 


Naive 


1841 


0.24005 


14.298 


First Fit 


1195 


0.09228 


0 



Table 2: Message- passing experiment results for the 
five communication patterns. 



cause both communication patterns are optimized to 
perform best in a mesh allocation whose side lengths 
are powers of two, they perform efficiently with con- 
tiguous allocation. However, since MBS allocates mul- 
tiple such submesh blocks to each job, message passing 
is also surprisingly efficient under this allocation strat- 
egy. Therefore, with these highly mapping-sensitive 
applications, MBS performs nearly as well or better 
than the contiguous strategies, and Naive and Ran- 
dom allocation perform very poorly. 

From these message- passing experiments, it ap- 
pears that MBS and Naive allocation strategies out- 
perform both contiguous and Random non-contiguous 
allocation, with higher system utilization and in- 
creased job throughput, reflected in their faster fin- 
ishing times. They take advantage of the greater flex- 
ibility offered by non-contiguous allocation while still 
maintaining a degree of contiguity, as reflected in their 
moderate dispersal values. The packet blocking times 
indicate that this pays off in performance because con- 
tention is reduced in comparison to Random alloca- 
tion. We would expect contention effects to be even 
less significant in real parallel applications, where only 
a portion of the total execution time is spent in com- 
munication. 



6 Conclusions 

This paper investigates non-contiguous processor 
allocation strategies as a method for improving perfor- 
mance in message- passing multicomputers. Contigu- 
ous allocation schemes surfer from low utilization due 
to serious fragmentation problems, and experiments 
have demonstrated that there is a limit to the amount 
of improvement that can be achieved for contiguous 
allocation. 

We study three non-contiguous processor allocation 
strategies for mesh-based multicomputers and com- 
pare their performance with that of several well-known 
contiguous allocation schemes. To summarize our re- 
sults: 

o Non- contiguous allocation strategies dra- 
matically outperform contiguous allocation 
strategies with respect to fragmentation. 

As a result system utilizations for non-contiguous 
schemes reach as high as 77% compared to utiliza- 
tions of 34% to 46% for contiguous schemes when 
message- passing contention is not considered. 

© The non-contiguous allocation algorithms 
perform better overall than the conttgu- 
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ous ones, even when message-passing con- 
tention is considered. The increased con- 
tention due to non-contiguous allocation is not as 
serious as the fragmentation effects of contiguous 
allocation. 

• Non-contiguous allocation strategies that 
take advantage of non-contiguity while pro- 
viding some degree of contiguity exhibit 
the best performance. The fully contiguous 
First Fit algorithm and the fully non- contiguous 
Random allocation algorithm exhibited the worst 
performance, while the best performance was 
achieved by the Multiple Buddy Strategy and 
Naive allocation. 

• Non -contiguous allocation is feasible on 
present day multicomputer with worm- 
hole routing. Current operating system over- 
head on the Paragon XP/S-15 (Paragon OS 
Rl.l) subsumes the contention effects under non- 
contiguous allocation Even under improved op- 
erating systems, the contention effects are negli- 
gible for small messages (less than one kilobyte). 
A sample workload at NASA NAS shows 87% of 
all messages to be one kilobyte or less. 

Our study shows that non- contiguous strategies 
yield dramatic improvements in system performance 
because they eliminate both internal and external 
fragmentation. Furthermore, the amount of con- 
tention introduced by non-contiguity can be limited 
so their effects on utilization and throughput are min- 
imized. We conclude that non-contiguous allocation 
provides a new approach that will help highly parallel 
systems achieve excellent price/performance ratios in 
a high demand, multi user environment. 
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